{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ur8xi4C7S06n"
      },
      "outputs": [],
      "source": [
        "# Copyright 2024 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JAPoU8Sm5E6e"
      },
      "source": [
        "# Narrate a Multi-character Story with Gemini and Text-to-Speech\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/storytelling/storytelling.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Run in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Faudio%2Fspeech%2Fuse-cases%2Fstorytelling%2Fstorytelling.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Run in Colab Enterprise\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/audio/speech/use-cases/storytelling/storytelling.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/storytelling/storytelling.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/storytelling/storytelling.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/storytelling/storytelling.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/storytelling/storytelling.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/storytelling/storytelling.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/storytelling/storytelling.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>            \n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "42cc0c827290"
      },
      "source": [
        "| Author |\n",
        "| --- |\n",
        "| [Holt Skinner](https://github.com/holtskinner) |"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvgnzT1CKxrO"
      },
      "source": [
        "## Overview\n",
        "\n",
        "This notebook demonstrates how to use the [Gemini API in Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs) to generate a play script and create an audio performance with each character having a distinct voice using the [Text-to-Speech API](https://cloud.google.com/text-to-speech).\n",
        "\n",
        "The steps performed include:\n",
        "\n",
        "- Create a story using Gemini\n",
        "- Assign each character to a voice.\n",
        "- Synthesize each line based on character voice.\n",
        "- Combine audio files into one MP3 file."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aed92deeb4a0"
      },
      "source": [
        "### Costs\n",
        "\n",
        "This tutorial uses billable components of Google Cloud:\n",
        "\n",
        "* Gemini API in Vertex AI\n",
        "* Text-to-Speech\n",
        "* Cloud Storage\n",
        "\n",
        "Learn about [Text-to-Speech pricing](https://cloud.google.com/text-to-speech/pricing),\n",
        "and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),\n",
        "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n",
        "to generate a cost estimate based on your projected usage."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i7EUnXsZhAGF"
      },
      "source": [
        "## Getting Started\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NNSWiCNPjh_p"
      },
      "source": [
        "### Install Google Gen AI SDK, other packages and their dependencies\n",
        "\n",
        "Install the following packages required to execute this notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2b4ef9b72d43"
      },
      "outputs": [],
      "source": [
        "%pip install --upgrade -q google-genai google-cloud-texttospeech pydub pandas tqdm"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J3a6juReciKR"
      },
      "source": [
        "If you're running on a Mac, you will need to install [FFmpeg](https://ffmpeg.org/)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "BzCUF4oqciKR"
      },
      "outputs": [],
      "source": [
        "%%bash\n",
        "# Check if the system is macOS\n",
        "if [[ \"$(uname -s)\" == \"Darwin\" ]]; then\n",
        "    # Install using Homebrew\n",
        "    brew install ffmpeg\n",
        "fi"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YbMFqPZ3tnwz"
      },
      "source": [
        "Set the project and region.\n",
        "\n",
        "* Please note the **available regions** for Text-to-Speech, see [documentation](https://cloud.google.com/text-to-speech/docs/endpoints)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "GjSsu6cmUdEx"
      },
      "outputs": [],
      "source": [
        "# Use the environment variable if the user doesn't provide Project ID.\n",
        "import os\n",
        "\n",
        "from google import genai\n",
        "\n",
        "PROJECT_ID = \"[your-project-id]\"  # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n",
        "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n",
        "    PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n",
        "\n",
        "LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"global\")\n",
        "\n",
        "client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)\n",
        "\n",
        "TTS_LOCATION = \"global\"  # @param {type:\"string\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "opUxT_k5TdgP"
      },
      "source": [
        "### Authenticating your notebook environment\n",
        "\n",
        "* If you are using **Colab** to run this notebook, run the cell below and continue.\n",
        "* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "vbNgv4q1T2Mi"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "# Additional authentication is required for Google Colab\n",
        "if \"google.colab\" in sys.modules:\n",
        "    # Authenticate user to Google Cloud\n",
        "    from google.colab import auth\n",
        "\n",
        "    auth.authenticate_user()\n",
        "\n",
        "! gcloud config set project {PROJECT_ID}\n",
        "! gcloud auth application-default set-quota-project {PROJECT_ID}\n",
        "! gcloud auth application-default login -q"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "960505627ddf"
      },
      "source": [
        "### Import libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "PyQmSRbKA8r-"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import random\n",
        "import re\n",
        "\n",
        "from IPython.display import Audio\n",
        "from google.api_core.client_options import ClientOptions\n",
        "from google.cloud import texttospeech_v1beta1 as texttospeech\n",
        "from google.genai.types import GenerateContentConfig\n",
        "import pandas as pd\n",
        "from pydantic import BaseModel\n",
        "from pydub import AudioSegment\n",
        "from tqdm import tqdm"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v4gUI8WqciKS"
      },
      "source": [
        "### Define constants"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "id": "6Pl3un_YciKS"
      },
      "outputs": [],
      "source": [
        "DEFAULT_LANGUAGE = \"en\"\n",
        "NARRATOR_VOICE = \"en-GB-Chirp3-HD-Zubenelgenubi\"\n",
        "DEFAULT_VOICE = \"en-GB-Chirp3-HD-Umbriel\"\n",
        "\n",
        "SILENCE_LENGTH = 200  # In Milliseconds\n",
        "TXT_EXTENSION = \".txt\"\n",
        "\n",
        "api_endpoint = \"texttospeech.googleapis.com\"\n",
        "if TTS_LOCATION != \"global\":\n",
        "    api_endpoint = f\"{TTS_LOCATION}-{api_endpoint}\"\n",
        "\n",
        "tts_client = texttospeech.TextToSpeechClient(\n",
        "    client_options=ClientOptions(api_endpoint=api_endpoint)\n",
        ")\n",
        "\n",
        "SYSTEM_INSTRUCTION = \"\"\"You are a creative and ambitious play writer. Your goal is to write a play for audio performance. Include a narrator character to describe the scenes and actions occurring.\"\"\"\n",
        "\n",
        "MODEL_ID = \"gemini-2.5-flash\"\n",
        "\n",
        "\n",
        "class Character(BaseModel):\n",
        "    name: str\n",
        "    gender: str\n",
        "\n",
        "\n",
        "class DialogueLine(BaseModel):\n",
        "    speaker: str\n",
        "    line: str\n",
        "\n",
        "\n",
        "class Scene(BaseModel):\n",
        "    setting: str\n",
        "    dialogue: list[DialogueLine]\n",
        "\n",
        "\n",
        "class Story(BaseModel):\n",
        "    title: str\n",
        "    characters: list[Character]\n",
        "    scenes: list[Scene]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v5CEc4-Wrjk2"
      },
      "source": [
        "### Helper functions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {
        "id": "kYx2wwhjrmD6"
      },
      "outputs": [],
      "source": [
        "def list_voices(language_code: str = DEFAULT_LANGUAGE) -> list[dict]:\n",
        "    response = tts_client.list_voices(language_code=language_code)\n",
        "\n",
        "    return [\n",
        "        {\n",
        "            \"name\": voice.name,\n",
        "            \"gender\": texttospeech.SsmlVoiceGender(voice.ssml_gender).name.lower(),\n",
        "        }\n",
        "        for voice in response.voices\n",
        "        if (\"Chirp3\" in voice.name)\n",
        "        and voice.name != NARRATOR_VOICE\n",
        "        and \"en-IN\" not in voice.name\n",
        "    ]\n",
        "\n",
        "\n",
        "def create_character_map(\n",
        "    characters: list[Character], voices: list[dict]\n",
        ") -> dict[str, str]:\n",
        "    \"\"\"Maps characters to voices based on gender identified by Gemini.\"\"\"\n",
        "\n",
        "    if len(characters) > len(voices):\n",
        "        print(f\"Too many characters {len(characters)}. Max {len(voices)}\")\n",
        "\n",
        "    character_map: dict[str, str] = {}\n",
        "    male_voices = [voice[\"name\"] for voice in voices if voice[\"gender\"] == \"male\"]\n",
        "    female_voices = [voice[\"name\"] for voice in voices if voice[\"gender\"] == \"female\"]\n",
        "\n",
        "    for character in characters:\n",
        "        name = character.name\n",
        "        gender = character.gender.lower()\n",
        "\n",
        "        if name == \"Narrator\":\n",
        "            voice = NARRATOR_VOICE\n",
        "        elif gender == \"female\" and female_voices:\n",
        "            voice = female_voices.pop(random.randrange(len(female_voices)))\n",
        "        elif gender == \"male\" and male_voices:\n",
        "            voice = male_voices.pop(random.randrange(len(male_voices)))\n",
        "        else:\n",
        "            if male_voices and female_voices:\n",
        "                chosen_pool = random.choice([male_voices, female_voices])\n",
        "            elif male_voices:\n",
        "                chosen_pool = male_voices\n",
        "            elif female_voices:\n",
        "                chosen_pool = female_voices\n",
        "            else:\n",
        "                raise ValueError(\"Not enough voices to assign to all characters.\")\n",
        "\n",
        "            voice = chosen_pool.pop(random.randrange(len(chosen_pool)))\n",
        "\n",
        "        character_map[name] = voice\n",
        "\n",
        "    return character_map\n",
        "\n",
        "\n",
        "def synthesize_text(\n",
        "    file_prefix: str, file_index: str, text: str, voice_name: str\n",
        ") -> str:\n",
        "    output_file = f\"{file_prefix}-{file_index}.mp3\"\n",
        "\n",
        "    language_code = re.search(r\"\\b[a-z]{2}-[A-Z]{2}\\b\", voice_name).group()\n",
        "\n",
        "    response = tts_client.synthesize_speech(\n",
        "        input=texttospeech.SynthesisInput(text=text),\n",
        "        voice=texttospeech.VoiceSelectionParams(\n",
        "            language_code=language_code,\n",
        "            name=voice_name,\n",
        "        ),\n",
        "        audio_config=texttospeech.AudioConfig(\n",
        "            audio_encoding=texttospeech.AudioEncoding.MP3\n",
        "        ),\n",
        "    )\n",
        "\n",
        "    # The response's audio_content is binary.\n",
        "    with open(output_file, \"wb\") as f:\n",
        "        f.write(response.audio_content)\n",
        "    return output_file\n",
        "\n",
        "\n",
        "def combine_audio_files(audio_files: list[str], file_prefix: str) -> str:\n",
        "    full_audio = AudioSegment.silent(duration=SILENCE_LENGTH)\n",
        "\n",
        "    for file in audio_files:\n",
        "        sound = AudioSegment.from_mp3(file)\n",
        "        silence = AudioSegment.silent(duration=SILENCE_LENGTH)\n",
        "        full_audio += sound + silence\n",
        "        os.remove(file)\n",
        "\n",
        "    outfile_name = f\"{file_prefix}-complete.mp3\"\n",
        "    full_audio.export(outfile_name, format=\"mp3\")\n",
        "    return outfile_name\n",
        "\n",
        "\n",
        "def generate_audio_clips(\n",
        "    story: Story, character_map: dict[str, str]\n",
        ") -> tuple[list[str], str]:\n",
        "    file_prefix = re.sub(r\"[^\\w.-]\", \"_\", story.title).lower()\n",
        "    output_files: list[str] = []\n",
        "\n",
        "    lines: list[dict] = [\n",
        "        {\n",
        "            \"line\": story.title,\n",
        "            \"voice\": character_map.get(\"Narrator\", NARRATOR_VOICE),\n",
        "        }\n",
        "    ]\n",
        "\n",
        "    # Process each scene in the play\n",
        "    for scene in story.scenes:\n",
        "        # Add the scene setting with the Narrator's voice\n",
        "        lines.append(\n",
        "            {\n",
        "                \"line\": \"Setting... \" + scene.setting,\n",
        "                \"voice\": character_map.get(\"Narrator\", NARRATOR_VOICE),\n",
        "            }\n",
        "        )\n",
        "\n",
        "        # Process each dialogue in the scene\n",
        "        for dialogue in scene.dialogue:\n",
        "            lines.append(\n",
        "                {\n",
        "                    \"line\": dialogue.line,\n",
        "                    \"voice\": character_map.get(dialogue.speaker, DEFAULT_VOICE),\n",
        "                }\n",
        "            )\n",
        "\n",
        "    for file_index, line in tqdm(enumerate(lines, start=1), \"Generating audio clips\"):\n",
        "        output_files.append(\n",
        "            synthesize_text(file_prefix, file_index, line[\"line\"], line[\"voice\"])\n",
        "        )\n",
        "\n",
        "    return output_files, file_prefix"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4fa56795fdeb"
      },
      "source": [
        "## Generate play with Gemini"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "476afe69c41f"
      },
      "outputs": [],
      "source": [
        "PROMPT = \"\"\"Write an interesting and humorous version of the play Macbeth by William Shakespeare.\"\"\"\n",
        "\n",
        "response = client.models.generate_content(\n",
        "    model=MODEL_ID,\n",
        "    contents=PROMPT,\n",
        "    config=GenerateContentConfig(\n",
        "        system_instruction=SYSTEM_INSTRUCTION,\n",
        "        max_output_tokens=65535,\n",
        "        temperature=1.5,\n",
        "        top_p=0.95,\n",
        "        response_mime_type=\"application/json\",\n",
        "        response_schema=Story,\n",
        "    ),\n",
        ")\n",
        "story = response.parsed"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ed3f9e28f5a9"
      },
      "source": [
        "Alternatively, load a pre-generated play."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "id": "c3f5cac8983e"
      },
      "outputs": [],
      "source": [
        "with open(\"macbeth_the_sitcom.json\") as f:\n",
        "    story = Story.model_validate_json(f.read())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4e67586ace6c"
      },
      "outputs": [],
      "source": [
        "story"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F4NR_30rfuhA"
      },
      "source": [
        "## Get available English voices"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0KqGXuRVuBDf"
      },
      "outputs": [],
      "source": [
        "all_voices = list_voices()\n",
        "pd.DataFrame(all_voices)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LKMhnJ6mf6bb"
      },
      "source": [
        "## Assign voices to characters"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "otMX_-lfgIG1"
      },
      "outputs": [],
      "source": [
        "character_to_voice = create_character_map(story.characters, all_voices)\n",
        "character_to_voice"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p-laS59bgKo5"
      },
      "source": [
        "## Send play text to Text-to-Speech and output each line as an audio file\n",
        "\n",
        "The Text-to-Speech API can only create audio with one voice per API call, so we need to create separate files for each line."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "L6rYY43xgLwh"
      },
      "outputs": [],
      "source": [
        "output_files, file_prefix = generate_audio_clips(story, character_to_voice)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5tAMk2RbgO-C"
      },
      "source": [
        "## Combine audio files into a single file\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xl0ZnHStgRg3"
      },
      "outputs": [],
      "source": [
        "outfile_name = combine_audio_files(output_files, file_prefix)\n",
        "print(f\"Audio content written to file {outfile_name}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ce553e1efe43"
      },
      "source": [
        "## Listen to the audio"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {
        "id": "d_ax6LxzciKT"
      },
      "outputs": [],
      "source": [
        "Audio(outfile_name)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "storytelling.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
